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Abstract. This paper focuses on reinforcement learning (RL) with lim- 
ited prior knowledge. In the domain of swarm robotics for instance, the 
expert can hardly design a reward function or demonstrate the target 
behavior, forbidding the use of both standard RL and inverse reinforce- 
ment learning. Although with a limited expertise, the human expert is 
still often able to emit preferences and rank the agent demonstrations. 
Earlier work has presented an iterative preference-based RL framework: 
expert preferences are exploited to learn an approximate policy return, 
thus enabling the agent to achieve direct policy search. Iteratively, the 
agent selects a new candidate policy and demonstrates it; the expert 
ranks the new demonstration comparatively to the previous best one; 
the expert's ranking feedback enables the agent to refine the approxi- 
mate policy return, and the process is iterated. 

In this paper, preference-based reinforcement learning is combined with 
active ranking in order to decrease the number of ranking queries to the 
expert needed to yield a satisfactory policy. Experiments on the moun- 
tain car and the cancer treatment testbeds witness that a couple of dozen 
rankings enable to learn a competent policy. 

Keywords: reinforcement learning, preference learning, interactive op- 
timization, robotics 

1 Introduction 

Reinforcement learning (RL) |26)27j raises a main issue, that of the prior knowl- 
edge needed to efficiently converge toward a (nearly) optimal policy. Prior knowl- 
edge can be conveyed through the smart design of the state and action space, 
addressing the limited scalability of RL algorithms. The human expert can di- 
rectly demonstrate some optimal or nearly-optimal behavior, speeding up the 
acquisition of an appropriate reward function and/or the exploration of the RL 
search space through inverse reinforcement learning [23], learning by imitation 
[S], or learning by demonstration [inj. The use of preference learning, allegedly 
less demanding for the expert than inverse reinforcement learning, has also been 
investigated in RL, respectively to learn a reward function or a policy return 
function [2]. In the latter approach, referred to as preference-based policy learn- 
ing and motivated by swarm robotics, the expert is unable to design a reward 



function or demonstrate an appropriate behavior; the expert is more a knowl- 
edgeable person, only able to judge and rank the behaviors demonstrated by 
the learning agent. Like inverse reinforcement learning, preference-based policy 
learning learns a policy return; but demonstrations only rely on the learning 
agent, while the expert provides feedback by emitting preferences and ranking 
the demonstrated behaviors (section [5]) . 

Resuming the preference-based policy learning (Ppl) approach ,2J, the con- 
tribution of the present paper is to extend Ppl along the lines of active learning, 
in order to minimize the number of expert's ranking feedbacks needed to learn 
a satisfactory policy. However our primary goal is to learn a competent policy; 
learning an accurate policy return is but a means to learn an accurate policy. 
More than active learning per se, our goal thus relates to interactive optimization 
[1] and online recommendation [30 . The Bayesian settings used in these related 
works (section |2.4[ ) will inspire the proposed Active Preference-based Reinforce- 
ment Learning (April) algorithm. 

The difficulty is twofold. Firstly, the above Bayesian approaches hardly scale 
up to large-dimensional continuous spaces. Secondly, the Ppl setting requires 
one to consider two different search spaces. Basically, RL is a search problem on 
the policy space X, mappings of the state space on the action space. However, the 
literature underlines that complex policies can hardly be expressed in the state x 
action space for tractability reasons [21] . A thoroughly investigated alternative is 
to use parametric representations (see e.g. |25j among many others), for instance 
using the weight vectors of a neural net as policy search space X {X C H'^, 
with d in the order of thousands). Unfortunately earlier experiments suggest 
that parametric policy representations might be ill-suited to learn a preference- 
based policy return [21 . The failure to learn an accurate preference-based policy 
return on the parametric space is explained as the expert's preferences essentially 
relate to the policy behavior, on the one hand, and the policy behavior depends 
in a highly non-smooth way on its parametric description on the other hand. 
Indeed, small modifications of a neural weight vector x can entail arbitrarily large 
differences in the behavior of policy tTx , depending on the robot environment (the 
tendency to turn right or left in front of an obstacle might have far fetched impact 
on the overall robot behavior). 

The Ppl framework thus requires one to simultaneously consider the para- 
metric representation of policies (the primary search space) and the behavioral 
representation of policies (where the policy return, a.k.a. objective to be opti- 
mized, can be learned accurately). The distinction between the parametric and 
the behavioral spaces is reminiscent of the distinction between the input and 
the feature spaces, at the core of the celebrated kernel trick [7]. Contrasting 
with the kernel framework however, the mapping (mapping the parametric 
representation x of policy tt^ onto the behavioral description <?(x) of policy 
TTx) is non-smootl|^ In order for PPL to apply the abovementioned Bayesian 
approaches used in interactive optimization [4] or online recommendation |30) 



^ Interestingly, policy gradient methods face the same difficulties, and the guarantees 
they provide rely on the assumption of a smooth $ mapping [25| . 



where the objective function is defined on the search space, one should thus solve 
the inverse parametric-to-behavioral mapping problem and compute <P~^. How- 
ever, computing such inverse mappings is notoriously difficult in general |28) : it 
is even more so in the RL setting as it boils down to inverting the generative 
model. 

The technical contribution of the paper, at the core of the April algorithm, 
is to propose a tractable approximation of the Bayesian setting used in |4l30j . 
consistent with the parametric-to-behavioral mapping. The robustness of the 
proposed approximate active ranking criterion is first assessed on an artificial 
problem. Its integration within April is thereafter studied and a proof of con- 
cept of April is given on the classical mountain car problem, and the cancer 
treatment testbed first introduced by |32| . 

This paper is organized as follows. Section [2] briefly presents Ppl for self- 
containedness and discusses work related to preference-based reinforcement learn- 
ing and active preference learning. Section [3] gives an overview of April. Section 
|4.2| is devoted to the empirical validation of the approach and the paper con- 
cludes with some perspectives for further research. 

2 State of the art 

This section briefly introduces the notations used throughout the paper, assum- 
ing the reader's familiarity with reinforcement learning and referring to [26 for a 
comprehensive presentation. Preference-based policy learning, first presented in 
[2], is thereafter described for the sake of self-containedness, and discussed with 
respect to inverse reinforcement learning |lll8j and preference-based value learn- 
ing [6]. Lastly, the section introduces related work in active ranking, specifically 
in interactive optimization and online recommendation. 

2.1 Formal background 

Reinforcement learning classically considers a Markov decision process frame- 
work (S, A, p, r, 7, q), where S and A respectively denote the state and the action 
spaces, p is the transition model {p{s, a, s') being the probability of being in state 
s' after selecting action a in state s), r : 5 i— IR is a bounded reward function, 
< 7 < 1 is a discount factor, and q : S i-^ [0, 1] is the initial state probability 
distribution. To each policy tt {n{s, a) being the probability of selecting action 
a in state s), is associated policy return J(7r), the expected discounted reward 
collected by tt over time: 



RL aims at finding optimal policy tt* = arg max J(7r). Most RL approaches, 
including the famed value and policy iteration algorithms, rely on the fact that 




a value function : 5 i— ^ R can be defined from any policy tt, and that a policy 
GiV) can be greedily defined from any value function V: 

= r(s)+7^7r(s,a)p(s,a,s')K(s') (1) 

a 

g{V){s)= a.Tgma.x{V{p{s,a)),aeA} (2) 

Value and policy iteration algorithms, alternatively updating the value function 
and the policy (Eqs. ([T]) and ([2|), provide convergence guarantees toward the 
optimal policy provided that the state and action spaces are visited infinitely 
many times [35]. Another RL approach, referred to as direct policy learning [5S], 
proceeds by directly optimizing some objective function a.k.a. policy return on 
the policy space. 

2.2 Preference-based RL 

Preference-based policy learning (Ppl) was designed to achieve RL when the 
reward function is unknown and generative model-based approaches are hardly 
applicable. As mentioned, the motivating application is swarm robotics, where 
simulator-based approaches are discarded for tractability and accuracy reasons, 
and the individual robot reward is not known since the target behavior is defined 
at the collective swarm level. 

Ppl is an iterative 3-step process. During the demonstration step, the robot 
demonstrates a policy; during the ranking step, the expert ranks the new demon- 
stration comparatively to the previous best demonstration; during the self- 
training step, the robot updates its model of the expert preferences, and de- 
termines a new and hopefully better policy. Demonstration and policy trajectory 
or simply trajectory will be used interchangeably in the following. 

Let Ut = {uo, . . . Ut; (u^^ -< Ui^),i — 1 . . .t} the archive of all demonstra- 
tions seen by the expert and all ranking constraints defined from the expert's 
preference up to the t-th iteration. A utility function Jt is defined on the space 
of trajectories as 

Jj(u) = {wt,u) 

where weight vector Wt is obtained by standard preference learning, solving 
quadratic constrained optimization problem [P] 28 15 : 

Minimize F(w)) = lM\l + CJ:l<^<t^^l■^2 

(p) 

s.t. for all 1 < i < t (w, u^^) — (w, u^^) > 1 — 

Utility Jt defines a policy return on the space of policies, naturally defined as 
the expectation of Jt(u) over all trajectories generated from policy tt and still 
noted Jt by abuse of notations: 



Jtin) = E„^^[(wt,u)] 



(3) 



In [2], the next candidate policy TTt+i is determined by heuristically opti- 
mizing a weighted sum of the current policy return J^, and the diversity w.r.t. 
archive Lit- A more principled active ranking criterion is at the core of the April 
algorithm (section [3]) . 

2.3 Discussion 

Let us discuss Ppl with respect to inverse reinforcement learning (IRL) |lll8j . 
IRL is provided with an informed, feature-based representation of the state space 
<S (examples of such features (j)k{s) are the instant speed of the agent or whether 
it bumps in a pedestrian in state s). IRL exploits the expert's demonstration 
u* = {sq si S2 ■ ■ ■ s'^. . .) to iteratively learn a linear reward function rt{s) = 
(wt,^(s)) on the feature space. Interestingly, reward function rt also defines a 
utility function Jt on trajectories: letting u — (sqSi . . . Sh ■ ■ ■) be a trajectory, 

oo oo 
h=0 h=0 

where the k-th coordinate of ^(u) is given by J2h'=o^'^'f'ki^h)- As in Eq. ([s]), a 
policy return function on the policy space can be derived by setting Jt(7r) to the 
expectation of Jt(u) over trajectories u generated from tt. 

IRL iteratively proceeds by computing optimal policy nt from reward func- 
tion rt (using standard RL [1] or using Gibbs-sampling based exploration [T8]). 
and refining rt to enforce that JtiT^k) < Jt{u*) for k — 1. ..t. The process is 
iterated until reaching the desired approximation level. 

In summary, the agent iteratively learns a policy return and a candidate 
policy in both IRL and Ppl. The difference is threefold. Firstly, IRL starts with 
an optimal trajectory u* provided by the human expert (which dominates all 
policies built by the agent by construction) whereas Ppl is iteratively provided 
with bits of information (this demonstration is/isn't better than the previous 
best demonstration) by the expert. Secondly, in each iteration IRL solves an RL 
problem using a generative model, whereas Ppl achieves direct policy learning. 
Thirdly, IRL is provided with an informed representation of the state space. 

Let us likewise discuss Ppl w.r.t. preference-based value learning [6j. For 
each state s, each action a is assessed in [B] by executing the current policy 
until reaching a terminal state (rollout). On the basis of these rollouts, actions 
are ranked conditionally to s (e.g. a <s a'); the authors advocate that action 
ranking is more fiexible and robust than a supervised learning based approach 
[20] , discriminating the best actions in the current state from the other actions. 
The main difference with Ppl thus is that (6j defines an order relation on the 
action space depending on the current state and the current policy, whereas Ppl 
defines an order relation on the policy space. 

2.4 Interactive optimization 

During the Ppl self-training step, the agent must find a new policy, expectedly 
relevant w.r.t. the current objective function Jt, with the goal of finding as fast as 



possible a (quasi) optimal solution policy. This same goal, cast as an interactive 
optimization problem, has been tackled by [J] and [3D] in a Bayesian setting. 

In 0] , the motivating application is to help the user quickly find a suitable vi- 
sual rendering in an image synthesis context. The search space X — is made 
of the rendering parameter vectors. The system displays a candidate solution, 
which is ranked by the user w.r.t. the previous ones. The ranking constraints are 
used to learn an objective function, represented as a Gaussian process using a 
binomial probit regression model. The goal is to provide as quickly as possible a 
good solution, as opposed to, the optimal one. Accordingly, the authors use the 
Expected Improvement over the current best solution as optimization criterion, 
and they return the best vector out of a finite sample of the search space. They 
further note that returning the optimal solution, e.g. using the Expected Global 
Improvement criterion [17^ with a branch-and-bound method, raises technical 
issues on high-dimensional search spaces. 

In |30| . the context is that of online recommendation systems. The system 
iteratively provides the user with a choice query, that is a (finite) set of solutions 
S*, of which the user selects the one she prefers. The ranking constraints are used 
to learn a linear utility function J on a low dimensional search space X = IR^, 
with J(x) = (w, x) and w a vector in R^. Within the Bayesian setting, the 
uncertainty about the utility function is expressed through a belief 9 defining a 
distribution over the space of utility functions. 

Formally, the problem of (iterated) optimal choice queries is to simultane- 
ously learn the user's utility function, and present the user with a set of good 
recommendations, such that she can select one with maximal expected utility. 
Viewed as a single-step (greedy) optimization problem, the goal thus boils down 
to finding a recommendation x with maximal expected utility ]Ee[(w, x)]. In a 
global optimization perspective however [30', the goal is to find a set of recom- 
mendations S = {xi, . . . ,Xfe} with maximum expected posterior utility^ defined 
as the expected gain in utility of the next decision. The expected utility of se- 
lection (EUS) is studied under several noise models, and the authors show that 
the greedy optimization of EUS provides good approximation guarantees of the 
optimal query. 

In 1^5] , the issue of the maximum expected value of information (EVOI) is 
tackled, and the authors consider the following criterion, where x* is the current 
best solution: select x maximizing 

EUS{-^) = lEe^,>,. [(w, x)] + lEe^,<,- [(w, x*)] (4) 

Eq. Q thus measures the expected utility of x, distinguishing the case where x 
actually improves on x* (l.h.s) and the case where x* remains the best solution 
(r.h.s). This criterion can be understood by reference to active learning and the 
so-called splitting index criterion [5|. Within the realizable setting (the solution 
lies in the version space of all hypotheses consistent with all examples so far), 
an unlabeled instance x splits the version space into two subspaces: that of 
hypotheses labelling x as positive, and that of hypotheses labelling x as negative. 
The ideal case is when instance x splits the version space into two equal size 



subspaces; querying x label thus optimally prunes the version space. In the 
general case, the splitting index associated to x is the relative size of the smallest 
subspace: the larger, the better. In active ranking, any instance x likewise splits 
the version space into two subspaces: the challenger subspace of hypotheses 
ranking x higher than the current best instance x*, and its complementary 
subspace. In the Bayesian setting, considering an interactive optimization goal, 
the stress is put on the expected utility of x on the challenger subspace, plus the 
expected utility of x* on the complementary subspace. 



3 April Overview 



Like Ppl, Active Preference-based Reinforcement Learning (April) is an itera- 
tive algorithm alternating a demonstration and a self-training phase. The only 
difference between Ppl and April lies in the self-training phase. This section 
first discusses the parametric and behavioral policy representations. It thereafter 
presents an approximation of the expected utility of selection criterion (AEUS) 
used to select the next candidate policy to be demonstrated to the expert, which 
overcomes the intractability of the EUS criterion (section 2.4 1 with regard to 
these two representations. 



3.1 Parametric and behavioral policy spaces 

As mentioned, April considers two search spaces. The first one noted X, referred 
to as input space or parametric space, is suitable to generate and run the policies. 
In the following X = R^; policy tTx is represented by e.g. the weight vector x of a 
neural net or the parameters of a control pattern generator (CPG) [22) . mapping 
the current sensor values onto the actuator values. As mentioned, the paramet- 
ric space is ill-suited to learn a preference-based policy return, as the expert's 
preferences only depend on the agent behavior and the agent behavior depends 
in an arbitrarily non-smooth way on the parametric policy representation. An- 
other space, noted ^{X) and referred to as feature space or behavioral space, 
thus needs be considered. Significant efforts have been made in RL to design a 
feature space suitable to capture the state- reward dependency (see e.g. [H]); in 
IRL in particular, the feature space encapsulates an extensive prior knowledge 
[1]. In the considered swarm robotics framework however, comprehensive prior 
knowledge is not available, and the lack of generative model implies that massive 
data are not available either to construct an informed representation. 

The proposed approach, inspired from [SJ, takes advantage of the fact that 
the agent is given for free the data stream made of its sensor and actuator values, 
generated along its trajectories in the environment (possibly after unsupervised 
dimensionality reduction). A frugal online clustering algorithm approach (e.g. 
£-means [5]) is used to define sensori- motor clusters. To each such cluster, re- 
ferred to as sensori-motor state (sms), is associated a feature. It thus comes 
naturally to describe a trajectory by the fraction of overall time it spends in 



every sensori-motor statcQ Letting D denote the number of sms, each trajectory 
generated from tt^, thus is represented as a unit vector in [0, 1]^ (||u3;||i = 1). 
The behavioral representation associated to parametric pohcy x, noted ^(x), 
finally is the distribution over [0, 1]^ of all trajectories u^; generated from policy 
TTx (reflecting the actuator and sensor noise, and the presence and actions of 
other robots in the swarm). 

Note that behavioral representation ${X) does not require any domain knowl- 
edge. Moreover, it is consistent despite the fact that the agent gradually discovers 
its sensori-motor space; new sms are added along the learning process as new 
policies are considered, but the value of new sms is consistently set to for 
earlier trajectories. 



3.2 Approximate expected utility of selection 

Let Ut = {uo, . . . Ut_i; (u^^ -< u^^), i = 1 . . .t} denote the archive of all demon- 
strations seen by the expert up the t-th iteration, and the ranking constraints 
defined from the expert preferences. With no loss of generality, the best demon- 
stration in Ut is noted U(_i. 

In Ppl the selection of the next policy to be demonstrated was based on 
the policy return Jt{T^x) = Eu^tt^ [{^t, u)], with wt solution of the problem (|P| 



(section 2.2). By construction however, Wj is learned from the trajectories in the 
archive; it does not reward the discovery of new sensori-motor states (as they are 
associated a weight by Wt). Instead of considering the only max margin solution 
Wt, the intuition is to consider the version space Wt of all w consistenlj^ with 
the ranking constraints in the archive lAt, along the same line as the expected 



utility of selection (EUS) criterion [30] (section 2.4). 

The EUS criterion cannot however be applied as such, since policy return 
Jt and the version space refer to the behavioral, trajectory space whereas the 
goal is to select an element on the parametric space; furthermore, both the 
behavioral and the parametric spaces are continuous and high-dimensional. An 
approximate expected utility of selection is thus defined on the behavioral and 
the parametric spaces, as follows. Let u^, denote a trajectory generated from 
policy -Kx- The expected utility of selection of Ua; can be defined as in I3U|, as 
the expectation over the version space of the max between the utility of u^; and 
the utility of the previous best trajectory u^: 

EUS(u.x) = j„ vFj'7iax((vif, u^r), (w, Ui))] 

Specifically, trajectory splits version space into a challenger version space 
noted (including all w with (w, u^;) > (w, uj))), and its complementary 

^ The use of the time fraction is chosen for simplicity; one might use instead the 

discounted cumulative time spent in every sms. 
^ While we cannot assume a realizable setting, i.e. the expert's preferences are likely to 

be noisy as noted by [3D| , the number of ranking constraints is always small relatively 

to the number D of sensori-motor states. One can therefore assume that the version 

space defined from Ut is not empty. 



subspace . The expected utility of selection of u^^ thus becomes: 



w vn W J. 



E 



^-[(w,Ut)] 



The expected utility of selection of policy t:^ is naturally defined as the expec- 
tation of EUS(uj:) over all trajectories u^; generated from policy tTx'- 



EUSiu,)] 



E 



^+[(w,u,)]+E^,^ [(w,u,) 



(5) 



Taking the expectation over all weight vectors w in or W~ is clearly 
intractable as w ranges in a high or medium-dimensional continuous space. Two 
approximations are therefore considered, defining the approximate expected util- 
ity of selection criterion (AEUS). The first one consists of approximating the 
center of mass of a version space by the center of the largest ball in this version 
space, taking inspiration from the Bayes point machine [H]. The center of mass 
of (respectively W^) is replaced by w+ (resp. w~) the solution of problem 
( [P| ) where constraint u^; > Uj (resp. u^; < Ut) is added to the set of constraints 
in archive Ut ■ As extensively discussed by [Uj , the SVM solution provides a good 
approximation of the Bayes point machine solution provided the dimensionality 



of the space is "not too high" (more about this in section 4.1). 

The second approximation takes care of the fact that the two version spaces 
and are unlikely of equal probability (said otherwise, the splitting 
index might be arbitrarily low). In order to approximate EUS{ux), one should 
thus further estimate the probability of and . Along the same line, the 
inverse of the objective value F{w~^) maximized by w"*" (section 2.2 problem 



[p| ) is used to estimate the probability of W^^: the higher the objective value, 
the smaller the margin and the probability of W^. Likewise, the inverse of the 
objective value F(w~) maximized by w~ is used to estimate the probability of 

Finally, the approximate expected utility of selection of a policy tTx is defined 

as: 



AEUSt(7r,) =E, 



1 . , , 1 



F(w+) 



F(w-) 



W ,Ut 



(6) 



3.3 Discussion 

The fact that April considers two policy representations, the parametric and the 
behavioral or feature space, aims at addressing the expressiveness/tractability 
dilemma. On the one hand, a high dimensional continuous search space is re- 
quired to express competent policies. But such high-dimensional search space 
makes it difficult to learn a preference-based policy return from a moderate 
number of rankings, keeping the expert's burden within reasonable limits. On 
the other hand, the behavioral space does enable to learn a preference-based 



policy return from the little available evidence in the archive (note that the di- 
mension of the behavioral space is controlled by April) although the behavioral 
description might be insufficient to describe a flexible policy. 

The price to pay for dealing with both search spaces lies in the two approx- 
imations needed to transform the expected utility of selection (Eq. ([s])) into a 
tractable criterion (Eq. (|6|), replacing the two centers of mass of version spaces 
and (i.e. the solutions of the Bayes point machine |l4j) with the so- 
lutions of the associated support vector machine problems, and estimating the 
probability of these version spaces from the objective values of the associated 
SVM problems. 



4 Experimental results 

This section presents the experimental setting followed to validate April. Firstly, 
the performance of the approximate expected utility of selection (AEUS) crite- 
rion is assessed in an artificial setting. Secondly, the performance of April is 
assessed comparatively to inverse reinforcement learning on two RL bench- 
mark problems. 



4.1 Validation of the approximate expected utility of selection 

The artificial active ranking problem used to investigate AEUS robustness is 
inspired from the active learning frame studied by [8j, varying the dimension d 
of the space in 10, 20, 50, 100 (Fig.[l]). 

In each run, a target utility function is set as a vector w* uniformly selected 
in the d-dimensional L2 unit sphere. A fixed sample S = {ui, . . . Uiooo} of 1,000 
points uniformly generatecj^ in the d-dimensional Li unit sphere is built. At 
iteration t, the sample u with best AEUS is selected in S; the expert ranks it 
comparatively to the previous best solution Ut , yielding a new ranking constraint 
(e.g. u < Ut), and the process is iterated. The AEUS performance at iteration t 
is computed as the scalar product of Ut and w*. 

AEUS is compared to an empirical estimate of the expected utility of selection 
(baseline eEUS). In eEUS, the current best sample u in is selected from Eq. 
([5|, computed from 10,000 points w selected in the version spaces Wf and 
and Ut is built as above. Another two baselines are considered: Random, 
where sample u is uniformly selected in S, and Max-Coord, selecting with no 
replacement u in S" with maximal L^o norm. The Max-Coord baseline was found 
to perform surprisingly well in the early active ranking stages, especially for 
small dimensions d. The empirical results reported in Fig. [T] show that AEUS 
is a good active ranking criterion, yielding a good approximation of EUS. The 
approximation degrades gracefully as dimension d increases: as noted by |14j . 
the center of the largest ball in a convex set yields a lesser good approximation 



The samples are uniformly selected in the d-dimensional Li unit sphere, to account 
for the fact that the behavioral representation of a trajectory has Li norm 1 by 



construction (section 3.1 1 




Fig. 1. Performance of AEUS comparatively to baselines (see text) vs number of pair- 
wise comparisons, depending on the dimension d of the search space (results averaged 
over 101 runs). 



of the center of mass thereof as dimension d increases. In the meanwhile, the 
approximation of the center of mass degrades too as a fixed number of 10,000 
points are used to estimate EUS regardless of dimension d. Random selection 
catches up as d increases; quite the contrary, Max-Coord becomes worse as d 
increases. 



4.2 Validation of April 

The main goal of the experiments is to comparatively assess April with respect 
to inverse reinforcement learning (IRL [1], section [2|. Both IRL and April 
extract the sought policy through an iterative two-step processes. The difference 
is that IRL is initially provided with an expert trajectory, whereas April receives 
one bit of information from the expert on each trajectory it demonstrates (its 
ranking w.r.t. the previous best trajectory). April performance is thus measured 
in terms of "expert sample complexity" , i.e. the number of bits of information 
needed to catch up compared to IRL. April is also assessed and compared to 



the black-box CMA-ES optimization algorithm, used with default parameters 

m- 

All three IRL, April and CMA-ES algorithms are empirically evaluated on 
two RL benchmark problems, the well-known mountain car problem and the 
cancer treatment problem first introduced by [32] ■ None of these problems has 
a reward function. 

Policies are implemented as 1-hidden layer neural nets with 2 input nodes 
and 1 output node, respectively the acceleration for the mountain car (resp. the 
dosage for the cancer treatment problem). The hidden layer contains 9 neurons 
for the mountain car (respectively 99 nodes for the cancer problem), thus the 
dimension of the parametric search space is 37 (resp. 397). 

RankSVM is used as learning-to-rank algorithm [16 , with linear kernel and 
C — 100. All reported results are averaged out of 101 independent runs. 

The cancer treatment problem. In the cancer treatment problem, a stochas- 
tic transition function is provided, yielding the next patient state from its current 
state (tumor size St and toxicity level tt) and the selected action (drug dosage 
level at): 

sj_i_i ^st+ 0.15max(tt,to) - 1.2(af - 0.5) x l{st > 0) + £ 
tt+i ^tt + 0.1 max(st. So) -I- 1.2(at - 0.5) + e 

Further, the transition model involves a stochastic death mechanism (modelling 
the possible patient death by means of a hazard rate model) . The same setting 
as in |S] is considered with three differences. Firstly, we considered a continuous 
action space (the dosage level is a real value in [0, 1]), whereas the action space 
contains 4 discrete actions in [6]. Secondly the time horizon is set to 12 instead 
of 6. Thirdly, a Gaussian noise e with mean and standard deviation a (ranging 
in 0, 0.05, 0.1, 0.2) is introduced in the transition model. The AEUS of the can- 
didate policies is computed as their empirical AEUS average over 11 trajectories 
(Eq.je]). 

The initial state is set to 1.3 tumor size and toxicity. For the sake of 
reproducibility the expert preferences are emulated by favoring the trajectory 
with minimal sum of the tumor size and toxicity level at the end of the 12-months 
treatment. 

The average performance (sum of tumor size and toxicity level) of the best 
policy found in each iteration is reported in Fig. [2j It turns out that the cancer 
treatment problem is an easy problem for IRL, that finds the optimal policy in 
the second iteration. A tentative interpretation for this fact is that the target 
behavior extensively visits the state with zero toxicity and zero tumor size; the 
learned w thus associates a maximal weight to this state. In subsequent iter- 
ations, IRL thus favors policies reaching this state as soon as possible. April 
catches up after 15 iterations, whereas CMA-ES remains consistently far from 
reaching the target policy in the considered number of iterations when there is 
no noise, and yields bad results (not visible on the plot) for higher noise levels. 




Fig. 2. The cancer treatment problem; Average performance (cumulative toxicity level 
and tumor size after 12-montlis treatment) of April, IRL and CMA-ES versus the 
number of trajectories demonstrated to the expert, for noise level 0, .05, .1 and .2. 
Results are averaged over 101 runs. 



The mountain car. The same setting as in [3T] is considered. The car state is 
described from its position and speed, initially set to position —0.5 with speed 
0. The action space is set to {—1, 0, 1}, setting the car acceleration. For the sake 
of reproducibility the expert preferences are emulated by favoring the trajectory 
which soonest reaches the top of the mountain, or is closest to the top of the 
mountain at some point during the 1000 time-step trajectory. 

Interestingly, the mountain car problem appears to be more difficult for IRL 
than the cancer treatment problem (Figjsj), which is blamed on the lack of expert 
features. As the trajectory is stopped when reaching the top of the mountain and 
this state does not appear in the trajectory description, the target reward would 
have negative weights on every (other) sms feature. IRL thus finds an optimal 
policy after 7 iterations on average. As for the cancer treatment problem, April 
catches up after 15 iterations, while the stochastic optimization never catches 
up in the considered number of iterations. 
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Fig. 3. The mountain car problem: Average performance (number of time steps needed 
to reach the top of the mountain) of April, IRL and CMA-ES versus the number of 
trajectories demonstrated to the expert. Results are averaged over 101 runs. 



5 Discussion and Perspectives 



The Active Preference-based Reinforcement Learning algorithm presented in this 
paper combines Preference-based Policy Learning [5] with an active ranking 
mechanism aimed at decreasing the number of comparison requests to the expert, 
needed to yield a satisfactory policy. 

The lesson learned from the experimental validation of April is that a very 
limited external information might be sufficient to enable reinforcement learning: 
while mainstream RL requires a numerical reward to be associated to each state, 
while inverse reinforcement learning |lll8j requires the expert to demonstrate a 
sufficiently good policy, April requires a couple dozen bits of information (this 
trajectory improves/does not improve on the former best one) to reach state of 
the art results. 

The proposed active ranking mechanism, inspired from recent advances in 
the domain of preference elicitation [29) . is an approximation of the Bayesian 
expected utility of selection criterion; on the positive side, AEUS is tractable 
in high-dimcnsional continuous search spaces; on the negative side, it lacks the 
approximate optimality guarantees of EUS. 

A first research perspective concerns the theoretical analysis of the April 
algorithm, specifically its convergence and robustness w.r.t. the ranking noise, 
and the approximation quality of the AEUS criterion. In particular, the com- 
putational effort of AEUS could be reduced with no performance loss by using 
Berstein races to decrease the number of empirical estimates (considered trajec- 
tories in Eq. |6]) and confidently discard unpromising solutions |13j . 

Another research perspective is related to a more involved analysis of the 
expert's preferences. Typically, the expert might (dis)like a trajectory because of 
some fragments of it (as opposed to, the whole of it). Along this line, a multiple- 
instance ranking setting [3] could be used to learn preferences at the fragment 



(sub-behavior) level, thus making steps toward the definition of sub-behaviors 
and modular RL. 

Another further work will be concerned with hybrid policies, combining the 
(NN-based) parametric policy and the model of the expert's preferences. The 
idea behind such hybrid policies is to provide the agent with both reactive and 
deliberative skills: while the action selection is achieved by the parametric policy 
by default, the expert's preferences might be exploited to reconsider these actions 
in some (discretized) sensori-motor states. 

On the applicative side, April will be experimented on large-scale robotic 
problems, where designing good reward functions is notoriously difficult. 
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